A Statistical Corpus-Based Term Extractor

نویسندگان

  • Patrick Pantel
  • Dekang Lin
چکیده

Term extraction is an important problem in natural language processing. In this paper, we propose a language independent statistical corpus-based term extraction algorithm. In previous approaches, evaluation has been subjective, at best relying on a lexicographer’s judgement. We evaluate the quality of our term extractor by assessing its predictiveness on an unseen corpus using perplexity. Second, we evaluate the precision and recall of our extractor by comparing the Chinese words in a segmented corpus with the words extracted by our system.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Comparative Evaluation of C-value in the Treatment of Nested Terms

In statistical term extraction systems the identification and selection of nested term candidates often presents a challenge. The paper presents an implementation and evaluation of C-value, a heuristic that ranks and/or discards nested terms according to their stability in the corpus. The method was tested for English and Slovene, for both the overall performance of the term extractor improved ...

متن کامل

An Efficient Patent Keyword Extractor As Translation Resource

The paper addresses the issue of resource reuse in patent translation. It presents an efficient patent keyword/phrase extraction tool and illustrates how the tool can be used in patent translation by both human experts and MT developers. The keyword extraction is based on a new hybrid methodology providing for intelligent output and computationally attractive properties. The tool is composed of...

متن کامل

Improving Term Extraction with Terminological Resources

Studies of different term extractors on a corpus of the biomedical domain revealed decreasing performances when applied to highly technical texts. Facing the difficulty or impossibility to customize existing tools, we developed a tunable term extractor. It exploits linguistic-based rules in combination with the reuse of existing terminologies, i.e. exogenous disambiguation. Experiments reported...

متن کامل

Concept Mining: A Conceptual Understanding based Approach

Due to the daily rapid growth of the information, there are considerable needs to extract and discover valuable knowledge from data sources such as the World Wide Web. Most of the common techniques in text mining are based on the statistical analysis of a term either word or phrase. These techniques consider documents as bags of words and pay no attention to the meanings of the document content...

متن کامل

Design of a Extraction System for Definitional Contexts from Biomedical Corpora

In this paper we show a general advance about the desgin of a methodology for extracting definitional contexts from corpus of biomedicine in Spanish, taking into account a set of processes performed by the following modules: (i) a term extractor based in a hybrid method, (ii) a set of verbs that configure the syntactic structure of a definitional context, (iii) a chunker able to recognize those...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2001